There are a lot of universities that offer the same graduate program but it can be difficult for prospective students to determine which university program they would most likely be admitted into. A prospective student being able to determine which university they would most likely be admitted into (given their scores) would save them both time from filling and sending out applications and money, since there is usually a cost/fee associated with putting in an application to the university itself. This will not only make the application process faster but also help alleviate some of the financial burden for prospective students when looking for graduate programs.
With this Graduate Admissions data set I will evaluate the performance of multiple models to determine which one performs the best given with the data. This will be a supervised learning problem and the models I will be using to evaluate this data is: Linear Regression, Random Forest, K-Nearest Neighbor (regression), and Decision Tree. I think it is important to utilize various different type of models so as to not limit the tools available to oneself for analysis. These models will be utilized through the scikit-learn implementations.
To ensure that the data will not have any data points that would otherwise skew the models in any particular way, i will be removing all records that have data that is more than three standard deviations away from the mean before i start my modeling. I will also be checking the predicted value from the models to make sure that they fall within the valid range (0 to 1), any value that does not fall within that range will be converted to be within the proper range. Meaning, values larger than 1 will be changed to 1 and values smaller than 0 will be changed to 0.
First, I will determine how the different models perform with the data by splitting the dataset into training and testing sets and using the training set for my initial model analysis. The testing set will be treated as "unseen" future data to be used to evaluate the best model i select from my analysis. I will evaluate the performance of each model using the adjusted R squared value, where the closer to one the better the model is. I will be using k fold cross validation to check to see if my models are under or over fitting my data and make changes accordingly.
Second, after determining which model performs the best with the data, I will then use that model to predict the chance of admission into the graduate program using the "unseen" test data to see how it performs. The model prediction performance will also be evaluated used adjusted R squared, where the closer to one the better the predictions are.
For my data set, I will be using a U.S. graduate admission data set that contains criteria that is used to determine if a student will be admitted into a U.S. graduate program or not.
Download Location: https://www.kaggle.com/mohansacharya/graduate-admissions
Columns (Higher is better for "Scores" and "Ranks"):
import numpy as np
import pandas as pd
import seaborn as sb
import pandas_profiling as pp
# scipy Libraries
from scipy.stats import norm, stats
from scipy import __version__ as scipv
# matplotlib Libraries
import matplotlib.pyplot as plt
from matplotlib import __version__ as mpv
# yellowbrick Libraries
from yellowbrick import __version__ as yb
from yellowbrick.style import set_palette
from yellowbrick.features import ParallelCoordinates
# sklearn Libraries
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn import __version__ as skv
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
# Library Versions
print('Using version %s of scipy' % scipv)
print('Using version %s of pandas' % pd.__version__)
print('Using version %s of numpy' % np.__version__)
print('Using version %s of sklearn' % skv)
print('Using version %s of seaborn' % sb.__version__)
print('Using version %s of yellowbrick' % yb)
print('Using version %s of matplotlib' % mpv)
print('Using version %s of pandas_profiling' % pp.__version__)
gradData = pd.read_csv('Graduate_Admission_Data/Admission_Predict_Ver1.1.csv')
print("The dimension of the table is: {:,} by {:,}".format(gradData.shape[0], gradData.shape[1]))
gradData.head()
The "Serial No." column is being dropped because it provides no additional student information that each row represents. It simply appears to be a re-index of each row in the dataset, made redundant by the dataframe index itself.
gradData.drop('Serial No.', axis='columns', inplace=True)
gradData.head()
seed = 74 # Seed for train/test split reproduction
x_train, x_test, y_train, y_test = train_test_split(gradData[gradData.columns[:-1]],
gradData['Chance of Admit'],
train_size=0.70,
random_state=seed)
print('x_train head:')
x_train.head()
print('y_train head:')
y_train.head()
before = x_train.shape[0]
print('x_train set:\n\nData size before outlier removal: {:,}'.format(before))
# Removes all records in the dataset that has data that is more than
# three standard deviations away from the mean of each column
x_train = x_train[(np.abs(stats.zscore(x_train)) < 3).all(axis=1)]
after = x_train.shape[0]
print(' Data size after outlier removal: {:,}\n\t Total records removed: {:,}'.format(after, before - after))
before = y_train.shape[0]
print('y_train set:\n\nData size before outlier removal: {:,}'.format(before))
# Removes all records in the dataset that has data that is more than
# three standard deviations away from the mean of each column
y_train = y_train[(np.abs(stats.zscore(y_train)) < 3)]
after = y_train.shape[0]
print(' Data size after outlier removal: {:,}\n\t Total records removed: {:,}'.format(after, before - after))
From the output above we can see that all the data is within three standard deviations of the mean of each of the columns and thus no data was removed from the dataset
print("Describe Data: x_train")
x_train.describe()
print("Describe Data: y_train")
y_train.describe()
print('x_train profiling: ')
pp.ProfileReport(x_train).to_notebook_iframe()
print('y_train profiling: ')
pp.ProfileReport(y_train).to_notebook_iframe()
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.5)
sb.set_style(style='white')
sb.heatmap(x_train.merge(y_train, left_index=True, right_index=True).corr(),
annot = True).set_title('Annotated Correlation Matrix')
From the Pandas Profiling report, we can see that there is no missing data, there is 7 numeric variables and 1 categorical variable (Research) and that there are some strong correlations between several variables: Chance of Admit and CGPA, CGPA and GRE Score, TOEFL Score and GRE Score, TOEFL and Chance of Admit. These variable correlations are made clearer with the annotated correlation matrix above.
fig = plt.figure()
fig.subplots_adjust(hspace=0.8, wspace=0.5)
fig.set_size_inches(13.5, 10)
sb.set(font_scale = 1.25)
hists = x_train.merge(y_train, left_index=True, right_index=True)
i = 1
for var in hists.columns:
fig.add_subplot(3, 3, i)
sb.distplot(pd.Series(hists[var], name=''),
fit=norm, kde=False).set_title(var + " Histogram")
plt.ylabel('Count')
i += 1
fig.tight_layout()
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.5)
sb.set_style(style='white')
set_palette('sns_bright')
paraGridValues = x_train.merge(y_train, left_index=True, right_index=True)
classes = ['No Research-Exp', 'Has Research-Exp']
columns = paraGridValues.columns.delete(6) # Remove the "Research" column
paraGridValuesNorm = paraGridValues.copy()
for col in columns:
paraGridValuesNorm[col] = ((paraGridValues[col] - paraGridValues[col].min()) /
(paraGridValues[col].max() - paraGridValues[col].min()))
parrCorrData = ParallelCoordinates(classes=classes, features=columns)
parrCorrData.fit_transform(paraGridValuesNorm[columns], paraGridValues['Research'])
parrCorrData.poof()
From the Parallel Coordinate Graph above, we can see that (overall) prospective students that had research experience had higher scores in each variable category (including admission chance) than those that did not have any research experience.
Thus, judging from the graph above, it is evident that students that have research experience are generally perceived to be better performers academically and have higher chances of being admitted into their graduate program.
Based on the preprocessing and analysis above, i can see that the data has no missing or duplicated values which would need to be accounted for in the later analysis. There are some strong correlations between several of the variables in the dataset and there also was no outlying data points that were found that would have needed to be removed in order to be prevent those data points from skewing analysis results.
features = x_train.values
target = y_train.values
pca = PCA(n_components=len(features[0]))
scaler = StandardScaler()
scaledFeatures = scaler.fit_transform(features)
featurePCA = pca.fit_transform(scaledFeatures, target)
featurePCA_df = pd.DataFrame(featurePCA, columns=['PC_{}'.format(x) for x in range(1, len(featurePCA[0]) + 1)])
print('Features: ', end="")
print(', '.join([col for col in x_train.columns]))
print(' Target: ' + str(y_train.name))
print('\nDataframe of PCA values:')
featurePCA_df.head()
pd.DataFrame(pca.components_, columns=x_train.columns,
index=['PC_{}'.format(x) for x in range(1, len(pca.components_) + 1)])
for index, val in enumerate(pca.explained_variance_ratio_):
print('Principal Component {}: {:>6}%'.format(index + 1, round(val * 100, 3)))
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.5)
cumSum = np.cumsum(pca.explained_variance_ratio_) * 100
plt.plot(cumSum, marker='o')
plt.xticks(range(0, len(cumSum)), range(1, 8))
for x, y in zip(range(1, 8), cumSum):
plt.annotate("{:.2f}%".format(y), (x, y), xytext=(-135, -17),
textcoords="offset points", annotation_clip = False)
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)')
plt.title('Explained Variance')
Based on the graph above of the principal components of the dataset and given the overall size of the data, i do not believe that utilizing any of the components for feature reduction in my data will provide any significant benefit to the analysis. While the above graph shows that i could use either 5 or 6 components to explain roughly 95-98% of the overall data, my prediction is that because my dataset is so small, with already few variables, there is not (in my opinion) a large enough benefit to warrant the utilization of feature reduction techniques for this dataset (in this case specifically PCA). This prediction will be explored in Part 3 of the analysis to determine if I was correct or not and make changes accordingly.
features = x_train.values
target = y_train.values
k = len(x_train.columns) # 7
skb = SelectKBest(score_func=f_regression, k=k)
scaler = StandardScaler()
scaledFeatures = scaler.fit_transform(features)
fit = skb.fit(scaledFeatures, target)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(x_train.columns)
feature_scores = pd.concat([df_columns, df_scores], axis=1)
feature_scores.columns = ['Feature_Name', 'Score']
print('SelectKBest\n===========\nk = ' + str(k) +
'\nScoring Method: f_regression\n\nDataframe of the SelectKBest scores per feature:')
feature_scores.nlargest(k, 'Score')
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.35)
barPlot = feature_scores.nlargest(k, 'Score')
plt.bar(barPlot['Feature_Name'], barPlot['Score'])
for index, value in enumerate(barPlot['Score']):
plt.text(index - 0.2, value, str(round(value, 2)))
plt.ylabel('Scores')
plt.xlabel('Features')
plt.title('Bar Graph of SelectKBest Scores from Largest to Smallest')
Based on the graph above and according to SelectKBest, the best features to use in my analysis would be CGPA (most important feature, the highest score), GRE Score, and TOEFL Score (the higher the SelectKBest Score, the more important it is). Those three features seem to be the important features to use in my analysis, while the remaining 4 features don't seem to have as large of an impact in the data given their much lower SelectKBest scores (with Research having the smallest impact).
Overall, I predict that the estimated computational savings would be marginal at best when conducting model analysis with or without these feature reductions. However, it is possible that utilizing certain features over others could also make my models perform better than they would without them. Thus, in Part 3 of this analysis, i will explore my predictions about these features to see whether or not i was correct and utilize the feature reductions accordingly.
I will be conducting regression modeling on all 4 of the models i outlined in the proposal (Linear Regression, Random Forest, K-Nearest Neighbor, and Decision Tree) with and without PCA feature reduction to see if there is or is not any improvements in the performance of the models.
Based on the explained variance graph of the PCA components (from Part 2, Step 2), i will be using only 5 of the 7 components for the feature reduction because the first 5 components explains roughly 95% of the values in the dataset.
The adjusted R squared scores of all the models (both with and without feature reduction) will be compared at the end to determine which model performed the best and thus the model that did perform the best, will be the model that is used to predict future graduate admissions chances.
For the hyperparameter selection of each model, the methodology i went with for deciding what parameters should be used in each GridSearchCV was to mainly focus on the different types of algorithms available per model and the growth of the model itself (more applicable to Knn, Random Forest and Decision tree). I took a look at the sklearn documentation and looked at all the parameters outlined for each model to it to determine which ones applied to my criteria. I felt that those parameters applicable to each model were the most important in helping to effectively determine the best parameters possible for each model. While fine tuning using each and every parameter available for each model would be ideal, i felt that this would take too much time to fine tune them all and might not be the best use of my time because, in my opinion, not all of the parameters available are as important as others (thus, comparatively, not worth the time).
def adjusted_R_squared(estimator, x, y):
# estimator.score(x, y) returns the r squared of the model
return round(1 - (1 - estimator.score(x, y)) * ((x.shape[0] - 1) / (x.shape[0] - x.shape[1] - 1)), 5)
lr_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('lr', LinearRegression())
]))
param_grid = {'lr__fit_intercept': [True, False],
'lr__normalize': [True, False]}
lr_grid = GridSearchCV(lr_pipe, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
lr_grid.fit(x_train, y_train)
lr_df = pd.DataFrame(lr_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
lr_df
print('Best Linear Regression Parameters\n=================================')
for name, val in lr_df.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('lr__', ''), val))
lr_adjR2 = lr_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(lr_adjR2, 4)))
lr_pipe_pca = Pipeline(steps=([
('scale', StandardScaler()),
('pca', PCA(n_components=5)), # Only using 5 components as outlined in the beginning of Part 3
('lr', LinearRegression())
]))
param_grid = {'lr__fit_intercept': [True, False],
'lr__normalize': [True, False]}
lr_grid_pca = GridSearchCV(lr_pipe_pca, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
lr_grid_pca.fit(x_train, y_train)
lr_df_pca = pd.DataFrame(lr_grid_pca.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
lr_df_pca
print('Best Linear Regression Parameters\n=================================')
for name, val in lr_df_pca.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('lr__', ''), val))
lr_adjR2_pca = lr_df_pca.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(lr_adjR2_pca, 4)))
rf_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('rf', RandomForestRegressor(random_state=seed))
]))
param_grid = {'rf__max_depth': np.arange(2, 12, 2),
'rf__max_features': ['auto', 'sqrt', 'log2'],
'rf__min_samples_leaf': [1, 2, 4],
'rf__min_samples_split': [2, 5, 10],
'rf__n_estimators': np.append(100, np.arange(200, 1200, 200))}
rf_grid = GridSearchCV(rf_pipe, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
rf_grid.fit(x_train, y_train)
rf_df = pd.DataFrame(rf_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
rf_df
print('Best Random Forest Regression Parameters\n========================================')
for name, val in rf_df.iloc[0]['params'].items():
print('{:>24}: {}'.format(name.replace('rf__', ''), val))
rf_adjR2 = rf_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(rf_adjR2, 4)))
rf_pipe_pca = Pipeline(steps=([
('scale', StandardScaler()),
('pca', PCA(n_components=5)), # Only using 5 components as outlined in the beginning of Part 3
('rf', RandomForestRegressor(random_state=seed))
]))
param_grid = {'rf__max_depth': np.arange(2, 12, 2),
'rf__max_features': ['auto', 'sqrt', 'log2'],
'rf__min_samples_leaf': [1, 2, 4],
'rf__min_samples_split': [2, 5, 10],
'rf__n_estimators': np.append(100, np.arange(200, 1200, 200))}
rf_grid_pca = GridSearchCV(rf_pipe_pca, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
rf_grid_pca.fit(x_train, y_train)
rf_df_pca = pd.DataFrame(rf_grid_pca.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
rf_df_pca
print('Best Random Forest Regression Parameters\n========================================')
for name, val in rf_df_pca.iloc[0]['params'].items():
print('{:>25}: {}'.format(name.replace('rf__', ''), val))
rf_adjR2_pca = rf_df_pca.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(rf_adjR2_pca, 4)))
knn_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('knn', KNeighborsRegressor())
]))
param_grid = {'knn__n_neighbors': np.arange(1, 50, 2),
'knn__weights': ['uniform'],
'knn__algorithm': ['ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': np.arange(1, 50, 2),
'knn__p': [1, 2]}
knn_grid = GridSearchCV(knn_pipe, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
knn_grid.fit(x_train, y_train)
knn_df = pd.DataFrame(knn_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(100)
knn_df
print('Best Knn Regression Parameters\n==============================')
for name, val in knn_df.iloc[0]['params'].items():
print('{:>15}: {}'.format(name.replace('knn__', ''), val))
knn_adjR2 = knn_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(knn_adjR2, 4)))
knn_pipe_pca = Pipeline(steps=([
('scale', StandardScaler()),
('pca', PCA(n_components=5)), # Only using 5 components as outlined in the beginning of Part 3
('knn', KNeighborsRegressor())
]))
param_grid = {'knn__n_neighbors': np.arange(1, 50, 2),
'knn__weights': ['uniform'],
'knn__algorithm': ['ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': np.arange(1, 50, 2),
'knn__p': [1, 2]}
knn_grid_pca = GridSearchCV(knn_pipe_pca, scoring=adjusted_R_squared,
refit=True, param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
knn_grid_pca.fit(x_train, y_train)
knn_df_pca = pd.DataFrame(knn_grid_pca.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(100)
knn_df_pca
print('Best Knn Regression Parameters\n==============================')
for name, val in knn_df_pca.iloc[0]['params'].items():
print('{:>15}: {}'.format(name.replace('knn__', ''), val))
knn_adjR2_pca = knn_df_pca.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(knn_adjR2_pca, 4)))
dt_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('dt', DecisionTreeRegressor(random_state=seed))
]))
param_grid = {'dt__criterion': ['mse', 'friedman_mse', 'mae'],
'dt__splitter': ['best', 'random'],
'dt__max_features': ['auto', 'sqrt', 'log2'],
'dt__max_depth': np.arange(1, 20, 2),
'dt__min_samples_leaf': [1, 2, 4],
'dt__min_samples_split': [2, 5, 10],
'dt__ccp_alpha': [0.0, 1.0]}
dt_grid = GridSearchCV(dt_pipe, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
dt_grid.fit(x_train, y_train)
dt_df = pd.DataFrame(dt_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
dt_df
print('Best Decision Tree Regression Parameters\n========================================')
for name, val in dt_df.iloc[0]['params'].items():
print('{:>23}: {}'.format(name.replace('dt__', ''), val))
dt_adjR2 = dt_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(dt_adjR2, 4)))
dt_pipe_pca = Pipeline(steps=([
('scale', StandardScaler()),
('pca', PCA(n_components=5)), # Only using 5 components as outlined in the beginning of Part 3
('dt', DecisionTreeRegressor(random_state=seed))
]))
param_grid = {'dt__criterion': ['mse', 'friedman_mse', 'mae'],
'dt__splitter': ['best', 'random'],
'dt__max_features': ['auto', 'sqrt', 'log2'],
'dt__max_depth': np.arange(1, 20, 2),
'dt__min_samples_leaf': [1, 2, 4],
'dt__min_samples_split': [2, 5, 10],
'dt__ccp_alpha': [0.0, 1.0]}
dt_grid_pca = GridSearchCV(dt_pipe_pca, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
dt_grid_pca.fit(x_train, y_train)
dt_df_pca = pd.DataFrame(dt_grid_pca.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
dt_df_pca
print('Best Decision Tree Regression Parameters\n========================================')
for name, val in dt_df_pca.iloc[0]['params'].items():
print('{:>23}: {}'.format(name.replace('dt__', ''), val))
dt_adjR2_pca = dt_df_pca.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(dt_adjR2_pca, 4)))
adj_R_sqaures = [lr_adjR2, rf_adjR2, knn_adjR2, dt_adjR2]
adj_R_sqaures_pca = [lr_adjR2_pca, rf_adjR2_pca, knn_adjR2_pca, dt_adjR2_pca]
modelTypes = ['Linear Regression', 'Random Forest', 'K-Nearest Neighbor', 'Decision Tree']
model_r_df = pd.DataFrame(zip(modelTypes, adj_R_sqaures, adj_R_sqaures_pca),
columns=['Model Type', 'Adj R Squared', 'Adj R Squared PCA'])
model_r_df = model_r_df.nlargest(len(model_r_df), 'Adj R Squared')
model_r_df
After performing model analysis on all 4 of my model types with and without PCA feature reduction and looking at the table of the adjusted R squared scores above, the model with the best performance with this data is the random forest regression without PCA feature reduction.
The table above shows that in all model types, except one, PCA feature reduction did not improve the performance of the model. The only exception being decision tree regression, which did improve slightly (~0.0074) over the decision tree model without PCA reduction. All the models (except decision tree) had a notable decrease in performance with PCA reductions, with a performance drop ranging between ~0.0125 and ~0.042 across the three other models.
Thus, based on all this information and analysis, the model I will be using for predicting the future graduate admissions chances will be the random forest model without PCA feature reduction.
print('Best Random Forest Regression Parameters\n' + '='*40)
params = {}
for name, val in rf_df.iloc[0]['params'].items():
name = name.replace('rf__', '')
params.update({name: val})
print('{:>24}: {}'.format(name, val))
print('\nAjusted R Squared: {}'.format(round(rf_df.iloc[0]['mean_test_score'], 4)))
best_model = Pipeline(steps=([
('scale', StandardScaler()),
('rf', RandomForestRegressor(**params, random_state=seed))
]))
best_model = best_model.fit(x_train, y_train)
best_model
y_pred = best_model.predict(x_test)
best_model_score = round(1 - (1 - r2_score(y_test, y_pred)) * ((y_test.shape[0] - 1) / (y_test.shape[0] - 0 - 1)), 4)
print("Best Random Forest model score using the test data\n" + '='*50 +
"\nAdjusted R Squared: {}".format(round(best_model_score, 4)))
print('\nDifference between experiment and best model adjusted R squared scores: {}'
.format(round(best_model_score - rf_adjR2, 4)))
Since the adjusted R squared value is so close to the value i received during my experiments, i am confident the model i have selected will perform well with future, unseen, data.
final_model = Pipeline(steps=([
('scale', StandardScaler()),
('rf', RandomForestRegressor(**params, random_state=seed))
]))
final_model.fit(gradData[gradData.columns[:-1]], gradData['Chance of Admit'])
Based on all my analysis and experimentation, i am confident that the final model i have created is the best performing model for utilizing future data for making predictions on the graduate admission chances of future prospective students.